ifood 期中報告¶
公司簡介¶
iFood 成立於 2011 年,是巴西領先的線上食品外送公司,服務範圍涵蓋巴西、哥倫比亞和墨西哥等地。透過 iFood 的行動應用程式,使用者可以選擇餐廳、查看菜單與價格,並輕鬆下單購買,類似於中國的美团或饿了么等外送平台。iFood 擁有超過 1,200 萬用戶,整合了近 7 萬家餐廳與 12 萬名送餐員,外送服務遍及巴西 500 多個城市。
iFood 致力於構建以「吃」為核心的生態系統,不僅提供食品外送服務,還涉足其他領域,如外送雜貨和藥局服務,甚至是寵物用品。其應用程式簡單易用,成功吸引多元受眾,並在巴西外送市場中占據主導地位。

公司銷售欄位簡介¶
人口統計資訊¶
- 年齡: Age
- 收入: Income
- 家中兒童數量: Kidhome
- 家中青少年數量: Teenhome
- 婚姻狀態:
- 婚姻狀態 - 離婚: marital_Divorced
- 婚姻狀態 - 已婚: marital_Married
- 婚姻狀態 - 單身: marital_Single
- 婚姻狀態 - 同居: marital_Together
- 婚姻狀態 - 喪偶: marital_Widow
- 教育程度:
- 教育程度 - 二級循環: education_2n Cycle
- 教育程度 - 基本: education_Basic
- 教育程度 - 畢業: education_Graduation
- 教育程度 - 碩士: education_Master
- 教育程度 - 博士: education_PhD
- 成為顧客的天數: Customer Days
購買行為¶
- 最近一次購買的天數: Recency
- 葡萄酒花費: MntWines
- 水果花費: MntFruits
- 肉類產品花費: MntMeatProducts
- 魚類產品花費: MntFishProducts
- 甜品花費: MntSweetProducts
- 黃金產品花費: MntGoldProds
- 常規產品花費: MntRegularProds
- 總花費: MntTotal
行銷活動與參與度¶
- 第 1 次行銷活動的接受情況: AcceptedCmp1
- 第 2 次行銷活動的接受情況: AcceptedCmp2
- 第 3 次行銷活動的接受情況: AcceptedCmp3
- 第 4 次行銷活動的接受情況: AcceptedCmp4
- 第 5 次行銷活動的接受情況: AcceptedCmp5
- 接受的行銷活動總數: AcceptedCmpOverall
- 對最後一次行銷活動的回應: Response
購買與聯繫行為¶
- 促銷購買次數: NumDealsPurchases
- 網上購買次數: NumWebPurchases
- 型錄購買次數: NumCatalogPurchases
- 商店購買次數: NumStorePurchases
- 每月網頁訪問次數: NumWebVisitsMonth
- 投訴記錄: Complain
業務成本與收入¶
- 聯繫成本(固定值): Z_CostContact
- 收入(固定值): Z_Revenue
讀取資料¶
file_path = 'ml_project1_data.csv'
df = pl.read_csv(file_path)
df
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | i64 | i64 | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 5524 | 1957 | "Graduation" | "Single" | 58138 | 0 | 0 | "2012-09-04" | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 2174 | 1954 | "Graduation" | "Single" | 46344 | 1 | 1 | "2014-03-08" | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4141 | 1965 | "Graduation" | "Together" | 71613 | 0 | 0 | "2013-08-21" | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 6182 | 1984 | "Graduation" | "Together" | 26646 | 1 | 0 | "2014-02-10" | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 5324 | 1981 | "PhD" | "Married" | 58293 | 1 | 0 | "2014-01-19" | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 10870 | 1967 | "Graduation" | "Married" | 61223 | 0 | 1 | "2013-06-13" | 46 | 709 | 43 | 182 | 42 | 118 | 247 | 2 | 9 | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4001 | 1946 | "PhD" | "Together" | 64014 | 2 | 1 | "2014-06-10" | 56 | 406 | 0 | 30 | 0 | 0 | 8 | 7 | 8 | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 |
| 7270 | 1981 | "Graduation" | "Divorced" | 56981 | 0 | 0 | "2014-01-25" | 91 | 908 | 48 | 217 | 32 | 12 | 24 | 1 | 2 | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 8235 | 1956 | "Master" | "Together" | 69245 | 0 | 1 | "2014-01-24" | 8 | 428 | 30 | 214 | 80 | 30 | 61 | 2 | 6 | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 9405 | 1954 | "PhD" | "Married" | 52869 | 1 | 1 | "2012-10-15" | 40 | 84 | 3 | 61 | 2 | 1 | 21 | 3 | 3 | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
資料整理¶
- 檢查缺失 (
null) 值 - 補充匯總性欄位(總銷售額、總購買次數)
檢查是否有缺失值。
df.describe()
| statistic | ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | f64 | f64 | str | str | f64 | f64 | f64 | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| "count" | 2240.0 | 2240.0 | "2240" | "2240" | 2216.0 | 2240.0 | 2240.0 | "2240" | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 | 2240.0 |
| "null_count" | 0.0 | 0.0 | "0" | "0" | 24.0 | 0.0 | 0.0 | "0" | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| "mean" | 5592.159821 | 1968.805804 | null | null | 52247.251354 | 0.444196 | 0.50625 | null | 49.109375 | 303.935714 | 26.302232 | 166.95 | 37.525446 | 27.062946 | 44.021875 | 2.325 | 4.084821 | 2.662054 | 5.790179 | 5.316518 | 0.072768 | 0.074554 | 0.072768 | 0.064286 | 0.013393 | 0.009375 | 3.0 | 11.0 | 0.149107 |
| "std" | 3246.662198 | 11.984069 | null | null | 25173.076661 | 0.538398 | 0.544538 | null | 28.962453 | 336.597393 | 39.773434 | 225.715373 | 54.628979 | 41.280498 | 52.167439 | 1.932238 | 2.778714 | 2.923101 | 3.250958 | 2.426645 | 0.259813 | 0.262728 | 0.259813 | 0.245316 | 0.114976 | 0.096391 | 0.0 | 0.0 | 0.356274 |
| "min" | 0.0 | 1893.0 | "2n Cycle" | "Absurd" | 1730.0 | 0.0 | 0.0 | "2012-07-30" | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 11.0 | 0.0 |
| "25%" | 2829.0 | 1959.0 | null | null | 35322.0 | 0.0 | 0.0 | null | 24.0 | 24.0 | 1.0 | 16.0 | 3.0 | 1.0 | 9.0 | 1.0 | 2.0 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 11.0 | 0.0 |
| "50%" | 5462.0 | 1970.0 | null | null | 51390.0 | 0.0 | 0.0 | null | 49.0 | 174.0 | 8.0 | 67.0 | 12.0 | 8.0 | 24.0 | 2.0 | 4.0 | 2.0 | 5.0 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 11.0 | 0.0 |
| "75%" | 8427.0 | 1977.0 | null | null | 68487.0 | 1.0 | 1.0 | null | 74.0 | 504.0 | 33.0 | 232.0 | 50.0 | 33.0 | 56.0 | 3.0 | 6.0 | 4.0 | 8.0 | 7.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 11.0 | 0.0 |
| "max" | 11191.0 | 1996.0 | "PhD" | "YOLO" | 666666.0 | 2.0 | 2.0 | "2014-06-29" | 99.0 | 1493.0 | 199.0 | 1725.0 | 259.0 | 263.0 | 362.0 | 15.0 | 27.0 | 28.0 | 13.0 | 20.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 11.0 | 1.0 |
發現 Income 有缺失值,佔據整個 dataset 的 1%。直接刪除即可。
df = df.drop_nulls(subset=['Income'])
新增 TotalSales(總銷售額)
df = df.with_columns(
TotalSales=pl.sum_horizontal(
'MntWines',
'MntFruits',
'MntMeatProducts',
'MntFishProducts',
'MntSweetProducts',
'MntGoldProds',
))
df
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | TotalSales |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | i64 | i64 | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 5524 | 1957 | "Graduation" | "Single" | 58138 | 0 | 0 | "2012-09-04" | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 | 1617 |
| 2174 | 1954 | "Graduation" | "Single" | 46344 | 1 | 1 | "2014-03-08" | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 27 |
| 4141 | 1965 | "Graduation" | "Together" | 71613 | 0 | 0 | "2013-08-21" | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 776 |
| 6182 | 1984 | "Graduation" | "Together" | 26646 | 1 | 0 | "2014-02-10" | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 53 |
| 5324 | 1981 | "PhD" | "Married" | 58293 | 1 | 0 | "2014-01-19" | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 422 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 10870 | 1967 | "Graduation" | "Married" | 61223 | 0 | 1 | "2013-06-13" | 46 | 709 | 43 | 182 | 42 | 118 | 247 | 2 | 9 | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 1341 |
| 4001 | 1946 | "PhD" | "Together" | 64014 | 2 | 1 | "2014-06-10" | 56 | 406 | 0 | 30 | 0 | 0 | 8 | 7 | 8 | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 | 444 |
| 7270 | 1981 | "Graduation" | "Divorced" | 56981 | 0 | 0 | "2014-01-25" | 91 | 908 | 48 | 217 | 32 | 12 | 24 | 1 | 2 | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 1241 |
| 8235 | 1956 | "Master" | "Together" | 69245 | 0 | 1 | "2014-01-24" | 8 | 428 | 30 | 214 | 80 | 30 | 61 | 2 | 6 | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 843 |
| 9405 | 1954 | "PhD" | "Married" | 52869 | 1 | 1 | "2012-10-15" | 40 | 84 | 3 | 61 | 2 | 1 | 21 | 3 | 3 | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 | 172 |
新增 TotalPurchases(總購買次數)
df = df.with_columns(
TotalPurchases=pl.sum_horizontal(
'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases'
))
df
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | TotalSales | TotalPurchases |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | i64 | i64 | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 5524 | 1957 | "Graduation" | "Single" | 58138 | 0 | 0 | "2012-09-04" | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 | 1617 | 25 |
| 2174 | 1954 | "Graduation" | "Single" | 46344 | 1 | 1 | "2014-03-08" | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 27 | 6 |
| 4141 | 1965 | "Graduation" | "Together" | 71613 | 0 | 0 | "2013-08-21" | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 776 | 21 |
| 6182 | 1984 | "Graduation" | "Together" | 26646 | 1 | 0 | "2014-02-10" | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 53 | 8 |
| 5324 | 1981 | "PhD" | "Married" | 58293 | 1 | 0 | "2014-01-19" | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 422 | 19 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 10870 | 1967 | "Graduation" | "Married" | 61223 | 0 | 1 | "2013-06-13" | 46 | 709 | 43 | 182 | 42 | 118 | 247 | 2 | 9 | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 1341 | 18 |
| 4001 | 1946 | "PhD" | "Together" | 64014 | 2 | 1 | "2014-06-10" | 56 | 406 | 0 | 30 | 0 | 0 | 8 | 7 | 8 | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 | 444 | 22 |
| 7270 | 1981 | "Graduation" | "Divorced" | 56981 | 0 | 0 | "2014-01-25" | 91 | 908 | 48 | 217 | 32 | 12 | 24 | 1 | 2 | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 1241 | 19 |
| 8235 | 1956 | "Master" | "Together" | 69245 | 0 | 1 | "2014-01-24" | 8 | 428 | 30 | 214 | 80 | 30 | 61 | 2 | 6 | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 843 | 23 |
| 9405 | 1954 | "PhD" | "Married" | 52869 | 1 | 1 | "2012-10-15" | 40 | 84 | 3 | 61 | 2 | 1 | 21 | 3 | 3 | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 | 172 | 11 |
探索性分析¶
婚姻情況與教育情況。
import altair as alt
(
df.plot.bar(alt.X("Marital_Status", title="婚姻狀態"), y=alt.Y("count()", title="人數"))
| df.plot.bar(alt.X("Education", title="教育程度"), y=alt.Y("count()", title="人數"))
| df.plot.bar(alt.X("Education", title="教育程度"), alt.Y("count()", title="人數", stack="normalize"), alt.Color("Marital_Status", title="婚姻狀態"))
)
有小孩或青少年的比例。
(
df["Kidhome"]
.value_counts()
.plot.arc(color="Kidhome:N", theta="count")
.properties(title="孩子在家的數量")
| df["Teenhome"]
.value_counts()
.plot.arc(color="Teenhome:N", theta="count")
.properties(title="青少年在家的數量")
)
參與行銷活動的人數。
df.select(
Cmp1="AcceptedCmp1",
Cmp2="AcceptedCmp2",
Cmp3="AcceptedCmp3",
Cmp4="AcceptedCmp4",
Cmp5="AcceptedCmp5",
).sum().transpose(include_header=True, header_name="Campaigns").rename(
{"column_0": "count"}
).plot.bar(
alt.X("count", title="參加次數"),
alt.Y("Campaigns", title="活動"),
alt.Color("Campaigns", legend=None),
).properties(
title="各活動參加人數"
)
顧客年齡分佈。
df = df.with_columns(age=2020 - pl.col("Year_Birth"))
df.plot.bar(alt.X("age", bin=True, title="年齡"), y=alt.Y("count()", title="人數"))
顧客特性說明¶
- 年齡:主要集中在 40 歲以上,40-60 區間尤為最多。
- 孩子:大多沒有孩子,其次是一胎家庭。
- 婚姻狀態:大多已婚,其次是單身。
- 教育程度:大學以上之高學歷居多
初次客與回流客分析¶
假設購買次數大於 1 的為回流客,否則為初次客。
CustomerType = pl.Enum(["First-Time", "Returning"])
df = df.with_columns(
pl.when(pl.col("TotalPurchases") > 1)
.then(pl.lit("Returning"))
.otherwise(pl.lit("First-Time"))
.cast(CustomerType)
.alias("CustomerType")
)
df
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | TotalSales | TotalPurchases | age | CustomerType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | i64 | i64 | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | enum |
| 5524 | 1957 | "Graduation" | "Single" | 58138 | 0 | 0 | "2012-09-04" | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 | 1617 | 25 | 63 | "Returning" |
| 2174 | 1954 | "Graduation" | "Single" | 46344 | 1 | 1 | "2014-03-08" | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 27 | 6 | 66 | "Returning" |
| 4141 | 1965 | "Graduation" | "Together" | 71613 | 0 | 0 | "2013-08-21" | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 776 | 21 | 55 | "Returning" |
| 6182 | 1984 | "Graduation" | "Together" | 26646 | 1 | 0 | "2014-02-10" | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 53 | 8 | 36 | "Returning" |
| 5324 | 1981 | "PhD" | "Married" | 58293 | 1 | 0 | "2014-01-19" | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 422 | 19 | 39 | "Returning" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 10870 | 1967 | "Graduation" | "Married" | 61223 | 0 | 1 | "2013-06-13" | 46 | 709 | 43 | 182 | 42 | 118 | 247 | 2 | 9 | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 1341 | 18 | 53 | "Returning" |
| 4001 | 1946 | "PhD" | "Together" | 64014 | 2 | 1 | "2014-06-10" | 56 | 406 | 0 | 30 | 0 | 0 | 8 | 7 | 8 | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 | 444 | 22 | 74 | "Returning" |
| 7270 | 1981 | "Graduation" | "Divorced" | 56981 | 0 | 0 | "2014-01-25" | 91 | 908 | 48 | 217 | 32 | 12 | 24 | 1 | 2 | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 1241 | 19 | 39 | "Returning" |
| 8235 | 1956 | "Master" | "Together" | 69245 | 0 | 1 | "2014-01-24" | 8 | 428 | 30 | 214 | 80 | 30 | 61 | 2 | 6 | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 843 | 23 | 64 | "Returning" |
| 9405 | 1954 | "PhD" | "Married" | 52869 | 1 | 1 | "2012-10-15" | 40 | 84 | 3 | 61 | 2 | 1 | 21 | 3 | 3 | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 | 172 | 11 | 66 | "Returning" |
初次客和回流客的比例。
alt.Chart(df).transform_joinaggregate(total='count(*)').transform_calculate(
pct='1 / datum.total'
).mark_bar().encode(
alt.X('sum(pct):Q', axis=alt.Axis(format='%'), title="佔整體比例"),
alt.Y('CustomerType', title="顧客類型"),
).properties(
title="初次客 / 回頭客比例"
).interactive()
回頭客明顯多於初次客。
RFM 分群¶
# R: Recency, F: Frequency (購買次數), M: Monetary (花費金額)
rfm_df = df.select(Recency="Recency", Frequency="TotalPurchases", Monetary="TotalSales")
rfm_df
| Recency | Frequency | Monetary |
|---|---|---|
| i64 | i64 | i64 |
| 58 | 25 | 1617 |
| 38 | 6 | 27 |
| 26 | 21 | 776 |
| 26 | 8 | 53 |
| 94 | 19 | 422 |
| … | … | … |
| 46 | 18 | 1341 |
| 56 | 22 | 444 |
| 91 | 19 | 1241 |
| 8 | 23 | 843 |
| 40 | 11 | 172 |
將數值切成五等份,計算 RFM 分數,並為客戶打分。
rfm_df = rfm_df.with_columns(
# Recency 分數(越小越好)
RecencyScore=pl.col("Recency").qcut(5, labels=["5", "4", "3", "2", "1"]),
# Frequency 分數(越大越好)
FrequencyScore=pl.col("Frequency").qcut(5, labels=["1", "2", "3", "4", "5"]),
# Monetary 分數(越大越好)
MonetaryScore=pl.col("Monetary").qcut(5, labels=["1", "2", "3", "4", "5"]),
)
rfm_df = rfm_df.with_columns(
CustomerScore=pl.concat_str(
pl.col("RecencyScore"), pl.col("FrequencyScore"), pl.col("MonetaryScore")
))
rfm_df
| Recency | Frequency | Monetary | RecencyScore | FrequencyScore | MonetaryScore | CustomerScore |
|---|---|---|---|---|---|---|
| i64 | i64 | i64 | cat | cat | cat | str |
| 58 | 25 | 1617 | "3" | "5" | "5" | "355" |
| 38 | 6 | 27 | "4" | "1" | "1" | "411" |
| 26 | 21 | 776 | "4" | "4" | "4" | "444" |
| 26 | 8 | 53 | "4" | "2" | "1" | "421" |
| 94 | 19 | 422 | "1" | "4" | "3" | "143" |
| … | … | … | … | … | … | … |
| 46 | 18 | 1341 | "3" | "4" | "5" | "345" |
| 56 | 22 | 444 | "3" | "4" | "3" | "343" |
| 91 | 19 | 1241 | "1" | "4" | "5" | "145" |
| 8 | 23 | 843 | "5" | "5" | "4" | "554" |
| 40 | 11 | 172 | "3" | "2" | "2" | "322" |
篩選出最有價值的族群。
best_customers = rfm_df.filter(pl.col("CustomerScore") == "555")
best_customers
| Recency | Frequency | Monetary | RecencyScore | FrequencyScore | MonetaryScore | CustomerScore |
|---|---|---|---|---|---|---|
| i64 | i64 | i64 | cat | cat | cat | str |
| 2 | 29 | 1693 | "5" | "5" | "5" | "555" |
| 19 | 32 | 1274 | "5" | "5" | "5" | "555" |
| 14 | 24 | 1581 | "5" | "5" | "5" | "555" |
| 17 | 29 | 1804 | "5" | "5" | "5" | "555" |
| 1 | 35 | 1485 | "5" | "5" | "5" | "555" |
| … | … | … | … | … | … | … |
| 13 | 28 | 1730 | "5" | "5" | "5" | "555" |
| 6 | 25 | 1376 | "5" | "5" | "5" | "555" |
| 2 | 23 | 2092 | "5" | "5" | "5" | "555" |
| 6 | 23 | 1518 | "5" | "5" | "5" | "555" |
| 3 | 23 | 1282 | "5" | "5" | "5" | "555" |
最有價值群的 RFM 平均數。
best_customers.select(
pl.col("Recency").mean(), pl.col("Frequency").mean(), pl.col("Monetary").mean()
)
| Recency | Frequency | Monetary |
|---|---|---|
| f64 | f64 | f64 |
| 9.409091 | 26.772727 | 1545.772727 |
此群佔總人數多少百分比?
print("Best customers: {:.3%}".format(len(best_customers) / len(rfm_df)))
Best customers: 0.993%
分出 常貴客、一次客、新顧客、流失客。
rfm_df = rfm_df.with_columns(
RecencyLevel=pl.col("Recency").qcut(2, labels=["Low", "High"]),
FrequencyLevel=pl.col("Frequency").cut([1], labels=["Low", "High"]),
)
rfm_df = rfm_df.with_columns(
pl.when((pl.col("RecencyLevel") == "Low") & (pl.col("FrequencyLevel") == "Low"))
.then(pl.lit("流失客"))
.when((pl.col("RecencyLevel") == "Low") & (pl.col("FrequencyLevel") == "High"))
.then(pl.lit("新顧客"))
.when((pl.col("RecencyLevel") == "High") & (pl.col("FrequencyLevel") == "Low"))
.then(pl.lit("先前客"))
.when((pl.col("RecencyLevel") == "High") & (pl.col("FrequencyLevel") == "High"))
.then(pl.lit("常貴客"))
.alias("CustomerLevel")
)
rfm_df
| Recency | Frequency | Monetary | RecencyScore | FrequencyScore | MonetaryScore | CustomerScore | RecencyLevel | FrequencyLevel | CustomerLevel |
|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | cat | cat | cat | str | cat | cat | str |
| 58 | 25 | 1617 | "3" | "5" | "5" | "355" | "High" | "High" | "常貴客" |
| 38 | 6 | 27 | "4" | "1" | "1" | "411" | "Low" | "High" | "新顧客" |
| 26 | 21 | 776 | "4" | "4" | "4" | "444" | "Low" | "High" | "新顧客" |
| 26 | 8 | 53 | "4" | "2" | "1" | "421" | "Low" | "High" | "新顧客" |
| 94 | 19 | 422 | "1" | "4" | "3" | "143" | "High" | "High" | "常貴客" |
| … | … | … | … | … | … | … | … | … | … |
| 46 | 18 | 1341 | "3" | "4" | "5" | "345" | "Low" | "High" | "新顧客" |
| 56 | 22 | 444 | "3" | "4" | "3" | "343" | "High" | "High" | "常貴客" |
| 91 | 19 | 1241 | "1" | "4" | "5" | "145" | "High" | "High" | "常貴客" |
| 8 | 23 | 843 | "5" | "5" | "4" | "554" | "Low" | "High" | "新顧客" |
| 40 | 11 | 172 | "3" | "2" | "2" | "322" | "Low" | "High" | "新顧客" |
rfm_df.plot.bar(alt.X("count()", title="人數"), alt.Y("CustomerLevel", title="顧客等級")).properties(title="顧客等級分佈")
K-means¶
K-means 自訂組數分群:將數值標準化。K-means 前最好將特徵值先標準化,免得各值差異大,分群會被值大的特徵值拉走。
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
rfm_df_numeric = rfm_df.select("Recency", "Frequency", "Monetary")
normalized_df = pl.DataFrame(
scaler.fit_transform(rfm_df_numeric.to_numpy()), schema=rfm_df_numeric.columns
)
normalized_df
| Recency | Frequency | Monetary |
|---|---|---|
| f64 | f64 | f64 |
| 0.310532 | 1.319446 | 1.675488 |
| -0.380509 | -1.157987 | -0.962358 |
| -0.795134 | 0.797881 | 0.28025 |
| -0.795134 | -0.897205 | -0.919224 |
| 1.554407 | 0.537099 | -0.307044 |
| … | … | … |
| -0.104093 | 0.406708 | 1.217598 |
| 0.241428 | 0.928273 | -0.270546 |
| 1.450751 | 0.537099 | 1.051696 |
| -1.417072 | 1.058664 | 0.391404 |
| -0.311405 | -0.506031 | -0.7218 |
執行 K-means。使用 Silhouette 值決定最好的分群。
早期的版本是使用肘部圖 (Elbow Method),但效果不佳,故只畫圖而不用肘部圖判斷,改用
silhouette_score。^1
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
def find_best_kmeans(normalized_df: pl.DataFrame) -> tuple[KMeans, list[float]]:
best_kmeans: KMeans | None = None
silhouette_avg_max = 0
inertias = list[float]()
for n in range(3, 12):
kmeans = KMeans(n_clusters=n, random_state=6).fit(normalized_df)
silhouette_avg = silhouette_score(normalized_df.to_numpy(), kmeans.labels_, random_state=6)
inertias.append(kmeans.inertia_)
picked = silhouette_avg > silhouette_avg_max
if picked:
best_kmeans = kmeans
silhouette_avg_max = silhouette_avg
print(f"n_clusters: {n}, silhouette_score: {silhouette_avg}, picked: {picked}")
assert best_kmeans is not None
return best_kmeans, inertias
kmeans, interias = find_best_kmeans(normalized_df)
n_clusters: 3, silhouette_score: 0.382027566450604, picked: True n_clusters: 4, silhouette_score: 0.3772275072151424, picked: False n_clusters: 5, silhouette_score: 0.36749151515577905, picked: False n_clusters: 6, silhouette_score: 0.32933115669120455, picked: False n_clusters: 7, silhouette_score: 0.3316218205019696, picked: False n_clusters: 8, silhouette_score: 0.3413972324456658, picked: False n_clusters: 9, silhouette_score: 0.3356077105119019, picked: False n_clusters: 10, silhouette_score: 0.336612138016919, picked: False n_clusters: 11, silhouette_score: 0.33650869759565205, picked: False
alt.Chart(pl.DataFrame({"n_clusters": list(range(3, 12)), "inertia": interias})).mark_line(
point=True
).encode(alt.X("n_clusters", title="n_clusters"), alt.Y("inertia", title="inertia")).properties(
title="Elbow Method for Optimal K"
)
對各群上標籤。
rfm_df = rfm_df.with_columns(Cluster=pl.Series(kmeans.labels_))
rfm_df
| Recency | Frequency | Monetary | RecencyScore | FrequencyScore | MonetaryScore | CustomerScore | RecencyLevel | FrequencyLevel | CustomerLevel | Cluster |
|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | cat | cat | cat | str | cat | cat | str | i32 |
| 58 | 25 | 1617 | "3" | "5" | "5" | "355" | "High" | "High" | "常貴客" | 2 |
| 38 | 6 | 27 | "4" | "1" | "1" | "411" | "Low" | "High" | "新顧客" | 1 |
| 26 | 21 | 776 | "4" | "4" | "4" | "444" | "Low" | "High" | "新顧客" | 0 |
| 26 | 8 | 53 | "4" | "2" | "1" | "421" | "Low" | "High" | "新顧客" | 1 |
| 94 | 19 | 422 | "1" | "4" | "3" | "143" | "High" | "High" | "常貴客" | 2 |
| … | … | … | … | … | … | … | … | … | … | … |
| 46 | 18 | 1341 | "3" | "4" | "5" | "345" | "Low" | "High" | "新顧客" | 0 |
| 56 | 22 | 444 | "3" | "4" | "3" | "343" | "High" | "High" | "常貴客" | 2 |
| 91 | 19 | 1241 | "1" | "4" | "5" | "145" | "High" | "High" | "常貴客" | 2 |
| 8 | 23 | 843 | "5" | "5" | "4" | "554" | "Low" | "High" | "新顧客" | 0 |
| 40 | 11 | 172 | "3" | "2" | "2" | "322" | "Low" | "High" | "新顧客" | 1 |
找到各群中心點並繪製成散佈圖。
kmeans_center = pl.DataFrame(
scaler.inverse_transform(kmeans.cluster_centers_), schema=normalized_df.schema
).with_row_index()
kmeans_center
| index | Recency | Frequency | Monetary |
|---|---|---|---|
| u32 | f64 | f64 | f64 |
| 0 | 21.95073 | 21.286496 | 1047.507299 |
| 1 | 49.199255 | 8.147114 | 109.060521 |
| 2 | 73.641414 | 21.146465 | 1101.20202 |
kmeans_center.plot.scatter(x="Frequency", y="Monetary", size="Recency", color="index:N").properties(
title="Frequency vs Monetary vs Recency"
)
第 2 群相對是最有價值的。列出第 2 群的 RFM 平均數。
best_customers = rfm_df.filter(pl.col("Cluster") == 2)
best_customers.select("Recency", "Frequency", "Monetary").mean()
| Recency | Frequency | Monetary |
|---|---|---|
| f64 | f64 | f64 |
| 73.555369 | 21.172819 | 1101.409396 |
查看這一群的 RFM 標籤分佈。
best_customers.plot.bar(alt.X("count()", title="人數"), alt.Y("CustomerLevel", title="顧客等級")).properties(title="顧客等級分佈")
各群特徵描述¶
看看每群人數。
rfm_df.plot.arc(theta=alt.Theta("count()", title="人數"), color=alt.Color("Cluster:N", title="分群")).properties(title="分群分佈")
看看每群購買東西的量、收入以及平均上次造訪天數。
columns_to_include = [
'MntWines',
'MntFruits',
'MntMeatProducts',
'MntFishProducts',
'MntSweetProducts',
'MntGoldProds',
'Income',
'Recency',
"Cluster",
]
df = df.with_columns(Cluster=rfm_df["Cluster"])
cluster_summary = df.select(columns_to_include).group_by("Cluster").mean().sort("Cluster")
cluster_summary
| Cluster | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | Income | Recency |
|---|---|---|---|---|---|---|---|---|
| i32 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| 0 | 529.065934 | 46.641026 | 295.289377 | 63.498168 | 44.818681 | 67.771062 | 66794.153846 | 21.855311 |
| 1 | 47.638734 | 5.731844 | 25.085661 | 8.529795 | 5.610801 | 16.463687 | 36460.982309 | 49.199255 |
| 2 | 563.840604 | 44.937919 | 305.189597 | 66.399329 | 49.327181 | 71.714765 | 67367.795302 | 73.555369 |
看看每一群購買的東西類型。
product_sales = (
df.group_by("Cluster")
.agg(
Wines=pl.sum("MntWines"),
Fruits=pl.sum("MntFruits"),
MeatProducts=pl.sum("MntMeatProducts"),
FishProducts=pl.sum("MntFishProducts"),
SweetProducts=pl.sum("MntSweetProducts"),
GoldProds=pl.sum("MntGoldProds"),
)
.sort("Cluster")
.drop("Cluster")
.transpose(
include_header=True,
header_name="Product",
column_names=["Cluster 0", "Cluster 1", "Cluster 2"],
))
(
product_sales.plot.bar(alt.X("Cluster 0"), alt.Y("Product"))
& product_sales.plot.bar(alt.X("Cluster 1"), alt.Y("Product"))
& product_sales.plot.bar(alt.X("Cluster 2"), alt.Y("Product"))
)
每一群參與活動的比例。
product_sales = (
df.group_by("Cluster")
.agg(
pl.sum("AcceptedCmp1"),
pl.sum("AcceptedCmp2"),
pl.sum("AcceptedCmp3"),
pl.sum("AcceptedCmp4"),
pl.sum("AcceptedCmp5"),
)
.sort("Cluster")
.drop("Cluster")
.transpose(
include_header=True,
header_name="Product",
column_names=["Cluster 0", "Cluster 1", "Cluster 2"],
))
(
product_sales.plot.bar(alt.X("Cluster 0"), alt.Y("Product"))
& product_sales.plot.bar(alt.X("Cluster 1"), alt.Y("Product"))
& product_sales.plot.bar(alt.X("Cluster 2"), alt.Y("Product"))
)
每一群的孩子數量。
df.group_by("Cluster").agg(pl.sum("Kidhome"), pl.sum("Teenhome")).sort("Cluster").unpivot(
index="Cluster"
).plot.bar("Cluster:N", "value", xOffset="variable", color="variable")
每一群的收入分佈。
df.group_by("Cluster").agg(pl.sum("Income")).sort("Cluster").plot.scatter("Income", "Cluster:N")
每一群的教育程度。
df.select("Cluster", "Education").plot.bar(
alt.X("Cluster:N"),
alt.Y("count()", title="人數"),
"Education",
xOffset="Education",
)
分類與預測¶
預測每一群會參與接下來行銷活動的機率?¶
normalized_df = df.with_columns(Cluster=rfm_df["Cluster"])
normalized_df
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | TotalSales | TotalPurchases | age | CustomerType | Cluster |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | i64 | i64 | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | enum | i32 |
| 5524 | 1957 | "Graduation" | "Single" | 58138 | 0 | 0 | "2012-09-04" | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 | 1617 | 25 | 63 | "Returning" | 2 |
| 2174 | 1954 | "Graduation" | "Single" | 46344 | 1 | 1 | "2014-03-08" | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 27 | 6 | 66 | "Returning" | 1 |
| 4141 | 1965 | "Graduation" | "Together" | 71613 | 0 | 0 | "2013-08-21" | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 776 | 21 | 55 | "Returning" | 0 |
| 6182 | 1984 | "Graduation" | "Together" | 26646 | 1 | 0 | "2014-02-10" | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 53 | 8 | 36 | "Returning" | 1 |
| 5324 | 1981 | "PhD" | "Married" | 58293 | 1 | 0 | "2014-01-19" | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 422 | 19 | 39 | "Returning" | 2 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 10870 | 1967 | "Graduation" | "Married" | 61223 | 0 | 1 | "2013-06-13" | 46 | 709 | 43 | 182 | 42 | 118 | 247 | 2 | 9 | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 1341 | 18 | 53 | "Returning" | 0 |
| 4001 | 1946 | "PhD" | "Together" | 64014 | 2 | 1 | "2014-06-10" | 56 | 406 | 0 | 30 | 0 | 0 | 8 | 7 | 8 | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 | 444 | 22 | 74 | "Returning" | 2 |
| 7270 | 1981 | "Graduation" | "Divorced" | 56981 | 0 | 0 | "2014-01-25" | 91 | 908 | 48 | 217 | 32 | 12 | 24 | 1 | 2 | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 1241 | 19 | 39 | "Returning" | 2 |
| 8235 | 1956 | "Master" | "Together" | 69245 | 0 | 1 | "2014-01-24" | 8 | 428 | 30 | 214 | 80 | 30 | 61 | 2 | 6 | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 843 | 23 | 64 | "Returning" | 0 |
| 9405 | 1954 | "PhD" | "Married" | 52869 | 1 | 1 | "2012-10-15" | 40 | 84 | 3 | 61 | 2 | 1 | 21 | 3 | 3 | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 | 172 | 11 | 66 | "Returning" | 1 |
首先將 Education, Marital_Status 等離散資料轉換為 one-hot encoding。
one_hot_df = df.select(
"Education",
"Marital_Status",
"Kidhome",
"Teenhome",
).to_dummies()
one_hot_df
| Education_2n Cycle | Education_Basic | Education_Graduation | Education_Master | Education_PhD | Marital_Status_Absurd | Marital_Status_Alone | Marital_Status_Divorced | Marital_Status_Married | Marital_Status_Single | Marital_Status_Together | Marital_Status_Widow | Marital_Status_YOLO | Kidhome_0 | Kidhome_1 | Kidhome_2 | Teenhome_0 | Teenhome_1 | Teenhome_2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
數值資料則進行標準化。
scaler = StandardScaler()
fields = ["Income", "Recency", "TotalPurchases", "TotalSales", "age"]
continuous_df = df.select(fields)
normalized_continuous_df = pl.DataFrame(
scaler.fit_transform(continuous_df.to_numpy()), schema=fields
)
normalized_continuous_df
| Income | Recency | TotalPurchases | TotalSales | age |
|---|---|---|---|---|
| f64 | f64 | f64 | f64 | f64 |
| 0.234063 | 0.310532 | 1.319446 | 1.675488 | 0.986443 |
| -0.234559 | -0.380509 | -1.157987 | -0.962358 | 1.236801 |
| 0.769478 | -0.795134 | 0.797881 | 0.28025 | 0.318822 |
| -1.017239 | -0.795134 | -0.897205 | -0.919224 | -1.266777 |
| 0.240221 | 1.554407 | 0.537099 | -0.307044 | -1.01642 |
| … | … | … | … | … |
| 0.356642 | -0.104093 | 0.406708 | 1.217598 | 0.151917 |
| 0.467539 | 0.241428 | 0.928273 | -0.270546 | 1.904422 |
| 0.188091 | 1.450751 | 0.537099 | 1.051696 | -1.01642 |
| 0.675388 | -1.417072 | 1.058664 | 0.391404 | 1.069896 |
| 0.024705 | -0.311405 | -0.506031 | -0.7218 | 1.236801 |
合併這些標準化後的資料,加上我們用 K-means 分的群。
predictable_df = pl.concat([
normalized_continuous_df,
one_hot_df,
df.select("Cluster", "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4"),
],
how="horizontal",
)
predictable_df
| Income | Recency | TotalPurchases | TotalSales | age | Education_2n Cycle | Education_Basic | Education_Graduation | Education_Master | Education_PhD | Marital_Status_Absurd | Marital_Status_Alone | Marital_Status_Divorced | Marital_Status_Married | Marital_Status_Single | Marital_Status_Together | Marital_Status_Widow | Marital_Status_YOLO | Kidhome_0 | Kidhome_1 | Kidhome_2 | Teenhome_0 | Teenhome_1 | Teenhome_2 | Cluster | AcceptedCmp1 | AcceptedCmp2 | AcceptedCmp3 | AcceptedCmp4 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| f64 | f64 | f64 | f64 | f64 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | i32 | i64 | i64 | i64 | i64 |
| 0.234063 | 0.310532 | 1.319446 | 1.675488 | 0.986443 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| -0.234559 | -0.380509 | -1.157987 | -0.962358 | 1.236801 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 0.769478 | -0.795134 | 0.797881 | 0.28025 | 0.318822 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| -1.017239 | -0.795134 | -0.897205 | -0.919224 | -1.266777 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 0.240221 | 1.554407 | 0.537099 | -0.307044 | -1.01642 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 0.356642 | -0.104093 | 0.406708 | 1.217598 | 0.151917 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0.467539 | 0.241428 | 0.928273 | -0.270546 | 1.904422 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 |
| 0.188091 | 1.450751 | 0.537099 | 1.051696 | -1.01642 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 1 |
| 0.675388 | -1.417072 | 1.058664 | 0.391404 | 1.069896 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0.024705 | -0.311405 | -0.506031 | -0.7218 | 1.236801 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |